Discovering Interpretable Topics in Free-style Text: Diagnostics, Rare Topics, and Topic Supervision
نویسندگان
چکیده
Massive databases with free-style text fields are a common feature of virtually all types of organizations from hospitals to aviation companies to governmental agencies. Perhaps the most promising approaches for intelligent, automatic text analysis are called “topic models”. Yet, it is likely also true that all topic models generate at least some topics that do not correspond to anything human analysts understand and can act upon. In this dissertation, we begin by synthesizing the literature on text modeling and information retrieval. We argue that the research has evolved from focusing on fast search/document retrieval to creating interpretable models of entire corpora, i.e., databases. We also argue that the topic model literature has largely failed to address statistical issues relating to data limitations, rare topics, and the associated effects on topic model accuracy. Next, we clarify the limitations of the standard measure of topic model accuracy, perplexity, for cases in which topic interpretability and accuracy are important. Then, we propose new measures including the “KL percentage” that provide absolute evaluations of the accuracy or “informativeness” of all topics in the model. Computational experiments show that the proposed measures are more sensitive and give different data requirement estimates than perplexity.
منابع مشابه
A review of text mining approaches and their function in discovering and extracting a topic
Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling. Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text min...
متن کاملSubtle Topic Models and Discovering Subtly Manifested Software Concerns Automatically
In a recent pioneering approach LDA was used to detect cross cutting concerns(CCC) automatically from in software codebases. LDA though successful in detecting prominent concerns, fails to detect many useful CCCs including ones that maybe heavily executed but elude discovery because they do not have a strong prevalence in source-code. We pose this problem as that of discovering topics that rare...
متن کاملModeling with Structured Priors for Text - Driven Science by Michael J . Paul
Many scientific disciplines are being revolutionized by the explosion of public data on the web and social media, particularly in health and social sciences. For instance, by analyzing social media messages, we can instantly measure public opinion, understand population behaviors, and monitor events such as disease outbreaks and natural disasters. Taking advantage of these data sources requires...
متن کاملLatent Dirichlet Allocation with Topic-in-Set Knowledge
Latent Dirichlet Allocation is an unsupervised graphical model which can discover latent topics in unlabeled data. We propose a mechanism for adding partial supervision, called topic-in-set knowledge, to latent topic modeling. This type of supervision can be used to encourage the recovery of topics which are more relevant to user modeling goals than the topics which would be recovered otherwise...
متن کاملFinding document topics for improving topic segmentation
Topic segmentation and identification are often tackled as separate problems whereas they are both part of topic analysis. In this article, we study how topic identification can help to improve a topic segmenter based on word reiteration. We first present an unsupervised method for discovering the topics of a text. Then, we detail how these topics are used by segmentation for finding topical si...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008